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Abstract 

This paper introduces constrained mixtures for continuous distributions, characterized by a 
mixture of distributions where each distribution has a shape similar to the base distribution 
and disjoint domains. This new concept is used to create generalized asymmetric versions 
of the Laplace and normal distributions, which are shown to define exponential families, 
with known conjugate priors, and to have maximum likelihood estimates for the original 
parameters, with known closed-form expressions. The asymmetric and symmetric normal 
distributions are compared in a linear regression example, showing that the asymmetric 
version performs at least as well as the symmetric one, and in a real world time-series 
problem, where a hidden Markov model is used to fit a stock index, indicating that the 
asymmetric version provides higher likelihood and may learn distribution models over states 
and transition distributions with considerably less entropy. 

Keywords: Asymmetric probability distribution, Exponential family, Hidden Markov 

models. Maximum likelihood estimation. Mixture models 


1. Introduction 


There is a plethora of probability distributions to fit the most diverse uses. However, 
even with this abundance of distributions, some applications can not be solv ed using them 
directly, requiring the use of probabilist ic graphs (jKoller and Friedmanl. 1200911 . like mixtur e 
models ( McLachlan and Basford . 1988 ), hidd en Markov models ( Baum and Petrie . 19661 ). 
or latent Dirichlet allocation ( Blei et ah . 200, 'll ) . where a set of distributions is used to build 
the joint probability distribution. 

While these more complex models provide additional flexibility to describe the problem, 
they are still limited by the underlying distributions used. This motivates the search for 
new distributions to describe some data peculiarity, and one of particular interest is the 
asymmetry of the distribution. 


There are naturally asymmetric distributions, such as the lognormal distribution ([Johnson et al 


19941 1. but it is also possi ble to introduce asymmetry i n symmetric distributions, like the 
skew normal distribution ( O’Hagan and Leonard . 19761 ) does. This distribution is able to 
control the skewness of the normal distribution, at the cost of losing closed-form expres¬ 
sions for the maximum likelihood estimates. Additionally, by modifying the shape of the 
distribution, its original interpretability is also lost. 

To keep the interpretability, which may be important when analyzing a fitted model, the 
shape of the distributions used must be maintained, such that the user can choose the ones 
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he or she knows how to analyze. For instance, this is what happens with mixture models, 
where the known base distributions just change their parameters and are weighted. 

In this paper, we introduce the concept of a constrained mixture of distributions for 
continuous distributions, which differs from the traditional mixture in that, instead of each 
distribution being defined in the whole domain and being able to overlap with the other dis¬ 
tributions, the domain is partitioned among the distributions. In this way, they are defined 
only in their segment, and all of them are instances of the same underlying distribution 
with different parameters that guarantee that the continuity of the original distribution is 
kept. This allows weighting each segment and analyzing them separately, like one would do 
with the distributions in a standard mixture model. 

The constrained mixture is then used to create asymmetric versions of the Laplace and 
normal distributions, where the symmetric versions are particular cases. These new distribu¬ 
tions are shown to define an exponential family when the partitions are known, which allows 
them to be easily used in exist ing models designed to wo rk with these kind s of distributions, 
like i n latent Dirichlet models ( Baneriee and Shan . 20071 ) and co-clustering ( Shan and Banerieel . 


20081 ). and their conjugate priors, with closed-form expressions, are also given. 


We also show for these new asymmetric distributions that, if the weight of each partition 
is known, then the maximum likelihood estimates are known and their closed-form expres¬ 
sions are provided. Furthermore, we provide a hill-climbing algorithm to ht the weight of 
the partitions, which allows maximum likelihood estimates for all the parameters. 

To show the power of introducing asymmetry to the normal distribution, two applica¬ 
tions are provided. The first is a simple linear regression example problem with asymmetric 
noise used to gain insight into how the asymmetry affects the estimation and show exper¬ 
imentally that the asymmetric likelihood is lower bounded by the symmetric likelihood. 
The second is a hidden Markov model used to ht a real world stock index time-series, which 
shows that the hexibility introduced by the asymmetry not only increases the likelihood, 
but may also provide insight into the system and reduce its entropy. 

This paper is organized as follows. Section [2] introduces the concept of constrained 
mixtures, and the asymmetric versions of the Laplace and normal distributions are intro¬ 
duced in Section [3l Section [4] proves optimality conditions for the maximum likelihood 
estimates and provides their closed-form expressions. Section [5] compares the performance 
of the asymmetric normal distribution with the symmetric version for one example and 
one real world problem, showing the advantages of the new distribution. Finally, Section [6] 
summarizes the hndings and indicates future research directions. 


2. Constrained Mixtnre 

A constrained mixture is a special kind of mixture of distributions characterized by the 
existence of only one underlying distribution so that the domain is split in disjoint segments. 
Each segment has its own distribution, which must be similar to the base distribution, 
that is, there are known parameters for the base distribution that provide the shape of 
the distribution in the segment. Moreover, the distributions must be continuous and the 
weights for each segment must be provided. 

Since a mixture of V > 2 distributions can be described as a mixture of 2 distributions, 
where one of those is a mixture of V — 1 distributions itself, we will develop the equations 
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only for the base-case of = 2. This not only simplifies the problem, but also is associated 
with the number of distributions used to create the asymmetric versions of the Laplace and 
normal distributions. 


Definition 1 (Constrained Mixture) Let </>(x; 0): M x ^>{9) [0, oo) be the continuous 

probability density function (pdf) for some distribution D, where T>{-) is the domain of its 
argument. Let = (p{x]9)I[x > fj] and 4>-{x', fj.,6) = 4>{x]9)I[x < p], where ![■] is 

the indicator function, be the partitions’ distributions. Let p G (0,1) be a weight parameter. 
Then the constrained mixture D* is described by a pdf i/jIx] p,9,p): x 'D{9) x (0,1) 

[0, oo) that satisfies the following constraints for all p, 9, and p in the domain: 

Constraint 1 (Continuity) The pdf is continuous at x = p, which means that 

lim 'il){x; p,9,p) = lim il>{x; p,9,p). 

IE— X—>fL~ 

Constraint 2 (Mixture) There are known functions Q±{p,9,p): 'RxV{9)x{0,l) ^ V{9) 
and normalizing constant Z G (0, oo) such that 

fj{x;p,9,p)Z =p(t>-{x;Q-{p,9,p)) + (1 - p)(l)+{x;Q+{p,9,p)). 

Constraint [T] guarantees that the continuity of (f{-) is preserved, while Constraint [2] 
builds a mixture that forces each segment of the new pdf '!/>(•) to have the same structure 
as the original pdf 4>{-), while also placing weight p and 1 — p on the left and right sides of 
the partition, respectively. The functions 0±(-) perform the mapping from the constraint 
parameter p and p and the underlying distribution parameters 0 to a new set of parameters 
0±(M) that are used in each side of the partition. 

From Constraint [2] and the fact that i/’(-) is a pdf, two additional redundant constraints 
can be defined, which will be used later to define auxiliary variables. 


Constraint 3 (Volume) Since is a pdf, it has unitary volume: 

roo 

/ ^^{x; p, 9,p)dx = 1. 


Constraint 4 (Weighting) 

which can be written as: 


The mixture places weight p in the left part of the distribution, 

fU 

/ ^^{x; p, 9,p)dx = p. 

J —OO 


The sampling of the new distribution D* can be performed by sampling u U{[0,1]) 
from the uniform distribution, followed by sampling from the distribution D'_ described 
by the non-normalized pdf </>-(•) if u < p or from D'j_, with non-normalized pdf 
otherwise. 

Moreover, if the split parameter p is fixed and the base distribution D dehne an ex¬ 
ponential family, then the new distribution D* also dehnes an exponential family. An 
exponential family is a set of probability distributions whose probability density functions 
can be expressed as 

f{x\9) = h{x) exp [ri{9f'T{x) - A{9)) , (1) 
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where 9 a re the parameters of the distribution and h{x), T{x), r]{6), and A{6) are known 


functions ( Baneriee et ah . 2005l i 


It is important to highlight that this result is not unexpected when using the constrained 
mixture. From Constraint [21 if the split position /r is known, both sides behave like the 
underlying distribution. Therefore, we expect the natural parameter r] to be produced 
by stacking the natural parameters r/_(0_) and r/+(0_|_) for both sides. Moreover, the 
sufficient statistics T should be produced by stacking T_I[x < fj] and T+I[x > fj], which are 
the statistics for each side of the distribution. 

We also note that we can not hope that the full distribution, without fixed /r, dehnes 
an exponential family too. Since the data is partitioned by /r, we cannot separate the data 
and parameters to create the term ri{9)'^T{x) in Equation ([T]). 


3. Asymmetric Distributions 

The constrained mixture defined in Section [2] can be used to create asymmetric versions 
of distributions. In this section, we will introduce the asymmetric Laplace and normal 
distributions, showing that the symmetric versions are particular cases with p = 0.5. Later, 
in Section 01 we will also show how to optimize the parameters for these new distributions. 
To avoid cluttering, some proofs for this section are presented in the Appendix. 

To break the symmetry of these distributions, the separation parameter p is placed at the 
mode, usually also denoted by p. Therefore, the following sections use them interchangeably, 
to avoid writing p for the mixture and p' for the underlying distribution. 


3.1 Laplace Distribution 

The Laplace distribution can be described by parameters 6 = [p, X) and pdf 

A) = ■^exp(-A|x -/x|). (2) 

From this, we will build the asymmetric version and prove that it generalizes the Laplace 
distribution. 


Theorem 2 (Asymmetric Laplace) Let p G (0,1), A G (0,oo), and // G M 6e given. 
Then the pdf given by: 


X{x-,p,\p) 


/3exp(—Aa(x — ;u)), x>p 
Pexp{Xa~^{x — p)), X < p, 


( 3 ) 


where a = y satisfies all constraints in Section\^ 

Proof See Appendix. ■ 


Corollary 3 (Symmetric Laplace) Let A G (0,oo) and p G M. be given. Let (f>(-) and 
V’(-) be defined as in Equations ([2]) and ([3]), respectively. Then the following holds: 

Vx G M, (j){x; p, X) = i/jIx] p, X,0.5). 
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Figure 1: Asymmetric Laplace distribution for variable p and /r = 0. As p gets smaller, less 
density is placed on negative values. 


Proof With p = 0.5, we have that a = 1 and P = A/2. Using these values in Equation ([3]), 
we arrive at Equation ([2]). ■ 


Corollary 4 (Asymmetric Laplace Exponential Family) Let ^ G M 6e given. Then 
the asymmetric Laplace pdf given by Equation © defines an exponential family with func¬ 
tions 


h{x) = 1, 


T{x) 


X — p\I[x > p] 

X — p\I[x < p] ’ 


A{X,p) 

'n{X,p) 


-ln/3, 

—Xa 

—Xa~^ 


(4a) 

(4b) 


Proof Using these functions in Equation ([T]), we can verify that it matches Equation ([3j). ■ 


Figure [T] shows the asymmetric Laplace pdf i/(-) for different combinations of x and p 
with p fixed to 0. It is clear that, with p getting closer to 0, the density is more strict 
on negative values, that is, they are less likely to occur. However, this also increases the 
uncertainty of positive values, which exhibit a slower decay. 

3.2 Normal Distribution 

The normal distribution can be described by parameters 6 = {p, a) and pdf 

p, a) = , (5) 
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-S- 



(a) Surface of the pdf 


(b) Contours of the pdf 


Figure 2: Asymmetric normal distribution for variable p and /r = 0. As p gets smaller, less 
density is placed on negative values 



where 

is the pdf of the standard normal distribution. 

Theorem 5 (Asymmetric Normal) Let p G (0,1), a G (0, oo), and /r G M &e given. Let 
$(•) be defined as in Equation ([6]). Then the pdf given by: 


'ip{x;p,a,p) = < 


aa 




era 


) , X < p, 


( 7 ) 


where a = satisfies all constraints in Section\^ 

Proof See Appendix. 


Corollary 6 (Symmetric Normal) Let a G (0,oo) and /r G M 6e given. Let </>(•) and 
be defined as in Equations ([5]) and d?]), respectively. Then the following holds: 

Vx G M, 4>{x] p,a) = ip{x] p,a,0.5). 

Proof With p = 0.5, we have that a = 1 and j3 = l/a. Using these values into Equa¬ 
tion 0, we arrive at Equation ([5]). ■ 
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Corollary 7 (Asymmetric Normal Exponential Family) Let /i € M 6e given. Then 
the asymmetric normal pdf given by Equation © defines an exponential family with func¬ 
tions 


h{x) = 




= -ln/3, 



(x — > fj] 

(x — iafil[x < p] 


1 

r(x) = 

, v{tp) = 

2cj2|a;“2 




2a‘^a‘^ 


(8a) 

(8b) 


Proof Using these functions in Equation ([T]), we can verify that it matches Equation ([7t). 


Figure [2] shows the asymmetric normal pdf for combinations of x and p with p fixed 
to 0. Just like the asymmetric Laplace distribution, p values closer to 0 are more strict on 
negative values, making the distribution more conservative on these cases. 


4. Parameter Optimization 

Once defined the new distributions, we are interested in adjusting their parameters to fit 
some data set. However, mixture models involve latent variables, such as the indicator of 
to which class a given sample belongs in standard mixture or the current state in hidden 
Markov models. For an asymmetric distribution, the indicator is given deterministically 
from p, since we just have to identify if the observed value is larger or smaller than the 
parameter qi. This parameter, in turn, depends on the weight p, which specifies how much 
probability to give to each side of p. 

This dependency between parameters makes the analysis and optimization process more 
complicated and, in our development, we were not able to find a solution to simultaneously 
optimize 6, /i, and p at the same time while providing guarantees. However, if we fix either 
// or p, then we are able to find formulations to optimize the others. 

Let S = G {1,2,... ,N},Si G M, be a set of samples. Using Constraint [21 the 

parameter’s log-likelihood can be written as: 


In C{fi,0,p\S) = —\S\\nZ -\-In Cp-\- In (9a) 

ln£p = |5_| Inp + |S+| ln(l — p) (9b) 

ln£ 0 = ^ ln(/>_(si; ©_(•)) + ^ In(/>+(si; ©+(•)), (9c) 

Si^S— s-i^S-\- 


where S- = {sj G 5|si < //} and 5+ = (s, & S\si > p.}. 

If we consider the parameter p fixed, then the maximum likelihood problem for both 
distributions has known optima, and they have closed-form expressions, as we will show in 
Sections 14.II and 14.21 Since p is only one value, it can be optimized numerically, as described 
in Section 14.31 

Alternatively, since both distributions were shown to define the exp onential families in 
Section [3] when // is fixed and exponential families have conjugate priors (|Barndorff-Nielsenl . 
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20141 ). then the new distributions must have conjugate priors, 
priors probability density function can be written as 

p{v\x, v) = fix, exp(r?'^x - vAir])), 


Moreover, the conjugate 


( 10 ) 


where r/ and Airj) are the natural parameters and a function of them. In this section, 
we will also find the priors and show that their structure is so und. With these priors 


one could com pute the posterior distribution over the parameters (|Barndorff-Nielsenl . l2014l : 
Bishop . 20061 ) or use the new distributions as part of a more complex inodel w ith intractable 
closed Morm, using an approach su ch as variational inference (IB lei et al.l . l2003l ) or Gibbs sam¬ 
pling (iGeman and Gemanl . 1 19841 ). since the best approximating posterior is the conjugate 
prior. 

Therefore, we provide two methods for optimizing the parameters, one where the par¬ 
tition weight p is defined and we compute the maximum likelihood, and one where the 
partitions themselves are defined through a fixed p and we can compute the full posterior 
on the parameters. It is important to highlight that, since the symmetric distributions 
are particular cases of the asymmetric ones, their likelihoods can not be higher than the 
asymmetric likelihoods for the same set data set. All proofs for this section are presented 
in the Appendix. 


4.1 Laplace Distribution 

Using the functions defined in the constrained mixture and in the proof of Theorem [2l the 
distribution-specific likelihood, given by Equation (l9^ can be written as: 


InT^ = \S\ lnA + (|5+| - |5_|)lna +A 


a 


-1 


(si-p)-a (si - p) 


Si£S- 


SiGS-i^ 


( 11 ) 


Using Cp from Equation 
verify that 


and the second term in the previous equation, we can 


I S'-1 Inp -I- 15”+1 In g -I- (15"+1 - |5_|) Ina 
= -^^^^(Inp -|- Ing) -b -|- Ini;) = -^(Inp -|- In( 7 ) 

= -\S\HiBei0.5)) - \S\DKLiBe{0.5)\\Be{p)), 


( 12 a) 

( 12 b) 

( 12 c) 


where q = 1 — p, Beip) is the Bernoulli distribution, H(-) is the entropy, and DfCLi') is 
the Kullback-Leibler divergence ( Kullback and Leiblerl . 195lh . Therefore, only the first and 
third terms in Equation dill) change with p and A. 

Moreover, the likelihood term that depends only on p decreases as p moves away from 
the symmetric version p = 0.5. This can be viewed as an implicit regularization of the 
asymmetry, since it comes directly from the distributions defined in Section [3] and reduces 
the likelihood as the asymmetry increases. Therefore, the distribution only becomes more 
asymmetric whenever the likelihood gain in data fitting is higher than the loss of becoming 
more asymmetric. 
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Theorem 8 (Asymmetric Laplace Optimality) Let p G (0,1) and S = i G 

{1,2,..., N}, Si G M, be given. Let the pdf of the asymmetric Laplace distribution be given 
by Equation ([3]) . Then the likelihood has an optimum where the partition p* is given by the 
weighted median, with samples in S- and 5+ weighted by a~^ and a, respectively, and 

_ 

« (si - t)- T.si&s- - h) ’ 

where S- = G S\si < p*}, 5+ = (sj G S\si > p*}, and a = 

Furthermore, let p\ and P 2 , p\ < P 2 , be optimal partitions. Then there is no Si G S 
such that pI < Si < P 2 , that is, all optimal partitions induce the same sets S- and 5+. 

Proof See Appendix. ■ 


It is important to highlight that we have to look at all possible partitions of S, compute 
their optimal p* given by the median, and check whether it induces the same partition. Since 
all optima induce the same partition, only one such median induce the partition used to 
create it, with the other values falling outside the required interval max S'- < p* < minS+. 

Alternatively, if we consider p fixed instead of p, we have shown in Section 13.11 that 
the asymmetric Laplace defines an exponential family, which means that it has a conjugate 
prior given by Equation (1101) . where p and A{p) are defined in Equation (j^. 


Theorem 9 (Asymmetric Laplace Conjugate Prior) Let the asymmetric Laplace dis¬ 
tribution be given by Equation with exponential family functions given by Equation (HD. 
Then its conjugate prior probability density function is given by 

f{p, A; i', x) = G{Xa; v, xi)G(Aa“^; u, X 2 )B{p; v'), 


where 


G{A;a,/3) = ^exp(-A/3) 

r(a) 


is the gamma distribution, r(-) is the gamma function. 


B{p-,a) = 


1 


p“-^(l-p) 


0—1 


B{a, a) 

is the symmetric beta distribution, B{-) is the beta function, and a 
Proof See Appendix. 



Since the prior for the Laplace distribution, in the format written in Equation ([2]), 
is the gamma distribution, and the prior for p, which can be seen as a parameter in a 
Bernoulli distribution deciding in which side of p the data will fall, is a beta distribution, it 
is reasonable to expect that the asymmetric Laplace prior has one gamma distribution for 
each side and one beta distribution for the deciding parameters, with their hyperparameters 
linked in a way that the final parameters always satisfy the conditions for a constrained 
mixture. 
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4.2 Normal Distribution 

Using the functions defined in the constrained mixture and in the proof of Theorem [5l the 
distribution-specific likelihood, given by Equation (fUcl) can be written as: 


InT^ 


C- |5|lnfT+ (|5+| - |5_|)lna 


YlsiSS- 

2a'‘^a‘^ 


2 (t 2 


(13) 


where C is a constant. 

Similarly to Equation (I12j) . we can show that the term associated with In a does not 
depend on the partition, once we consider Cp. Therefore, only the other terms are used in 
the optimization. 


Theorem 10 (Asymmetric Normal Optimality) Let p G (0,1) and S = {si},z G 
{1,2,..., N}, Si G M, be given. Let the pdf of the asymmetric normal distribution be given 
by Equation ©• Then the likelihood has a single optimum, where the optimal partition is 
given by 


-|- 0 ^ 15'+1 


and 




« ^ HsiGS- - T? + T.Si&S+ («* - 

\s\ 


where S- = {sj G 5|si < p*}, = {sj G S\si > p*}, and a 



Proof See Appendix. 


Similarly to the asymmetric Laplace, we have to look at all partitions and check whether 
the optimal p* is valid for that partition. 

Also similarly to the asymmetry Laplace, if we consider p fixed instead of p, we have 
shown in Section 13.21 that the asymmetric normal defines an exponential family, which 
means that it has a conjugate prior given by Equation (jlOh . where p and A(p) are defined 
in Equation (j8|). 

Theorem 11 (Asymmetric Normal Conjugate Prior) Let the asymmetric normal dis¬ 
tribution be given by Equation m, with exponential family functions given by Equation ([8]) . 
Then its conjugate prior probability density function is given by 

f{p, 0-; U x) = ^ 2 , X2)Ig{cr‘^a~‘^] 1 ^ 2 , Xi)B{p; vi), 

where 

/s(S;o,/3) = j{^S-»->exp(-|) 

is the inverse gamma distribution, r(-) is the gamma function, 

B{p-,a) = —^p--\l-pr-^ 

B[a, a) 
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is the symmetric beta distribution, B{-) is the beta function, a = M = 1 + z^/2, and 

V2 = z^/4 — 1. 


Proof See Appendix. 


Again, just like the asymmetric Laplace, the prior is in agreement with what is expected, 
since the prior for a variance is the inverse gamma distribution and the prior for p is a beta 
distribution. 

4.3 Asymmetry Parameter 

Sections 14.11 and IQ showed how n and 6 can be optimized in a closed form to maximize 
the likelihood for a fixed p. Since p is a single value, it can be optimized efficiently with a 
hill-climbing algorithm. 

Given a value of p, the log-likelihood can be written as in Equation (j9all . Let 

lnCip*,e*,p\S), PG(0,1) 

—oo, otherwise, 

where p* and 9* are the optimal values for a given p. Let the initial estimate of p be 
= 0.5, the initial step > 0, the tolerance e > 0 and the adjustment 1 > 7 > 0 be 
given. Then the hill-climbing algorithm works as follows: 

1. Initialize i = 0 and p^^^ = 0.5. 

2. Let = p^*^ — p and p^^ = p^*) -|- rj. 

3. If < e, stop. 

4. Let lW = L(p(*)), = L(p^^), and = L{pf). 

5. If > LW, then ph+i) = p^\ = p^®^, p+^^^ = p+^ + 

Go to step 4 with i = i + 1. 

6 . If then ph+i) = p+'''^^ = p^®\ = p^^ — p^*\ 

Go to step 4 with i = i + 1. 

7. Let ph+i) = p(®)^. Go to step 2 with i = i + 1. 

This simple algorithm keeps the best estimate of p at p^®^ and compares it with its p^®^ 
neighbors, moving to the direction that maximizes the likelihood. If the central estimate is 
the better, the step is reduced and the process is repeated until convergence. 

If the asymmetric distribution is part of a mixture, as in the example in Section 15.21 
then we must take certain precautions to avoid prematurely choosing a value of p. We have 
found that fixing the value of p to 0.5, such that the distribution behaves like its symmetric 
version, until convergence of the likelihood, and then performing the hill-climbing every 
time a maximum was being fit for the asymmetric distribution, thus allowing p to change, 


and p(®+^) = pf®). 
and p(*+^) = pW. 
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provided very good results and was able to avoid poor minima due to premature compromise 
of the value of p. Therefore, we first solve the symmetric problem until convergence, which 
should have less local minima due to less flexibility, then use its estimated parameters as 
initial conditions for the asymmetric problem, guaranteeing that the likelihood can only 
increase. 


5. Applications of the Proposed Asymmetric Distributions 


To demonstrate the characteristics of the new distributions, we propose two applications to 
compare the symmetric and asymmetric versions: one toy example to understand the fun¬ 
damentals and one real world example to explore deeper characteristics of the distribution. 
Since the normal distribution is frequently used, both applications will focus on it. 

A standard basic problem in machine learning is performing a linear regression to fit 
some data. Therefore the toy problem is composed of a linear regression, where the noise can 
be asymmetric. In this case, we will show that the asymmetric normal is able to consistently 
adapt to this asymmetry when it is present, providing higher likelihoods. 

We note that the re are approaches t hat use asymmetric noise models, such as the log- 
gamma distribution ( Bianco et ah . 20051 ). to perform the linear regression, but these other 
distributions may be unknown to the user and may be difficult to interpret. However, the 
normal distribution is very common and most people are familiar with it, which makes the 
new asymmetric normal distribution a good candidate for noise model, since each side of 
the partition can be interpreted as a normal distribution. 

The real world problem is given by learning a time-series using a hidden Markov model, 
where the emission distributions have now the flexibility of being asymmetric. We will show 
that this extra flexibility not only increases the likelihood, but may be able to reduce the 
entropy of the model. 


5.1 Asymmetric Linear Regression 

The standard linear regression problem is defined by finding a parameter vector /? G 
such that the relationship between an input x G and an output y G M can be described 

by 

y = I3'^4>{x) + e, e~A(0,cr2), 


IS a 


where 4>{x): —)• is a function that compute s features of t he input and N{p,, a'^) 

normal distribution with mean p and variance ( Bishop . 20061 ). One of the basic choices 
of (j){x) is the linear function, given by 4>{x) = \x, 1], such that f5 gives the slop and offset 
of a straight line. 

With the asymmetric normal distribution, introduced in Section 13.21 it is possible to 
generalize this model to include asymmetric noise, such that the relationship between input 
and output becomes 

y = P'^4>{x) + e, e ~ Aa(0,cr^p), 


where Na{p,(7‘^,p) is an asymmetric normal with partition p,, underlying variance and 
weighting p. 

Figure [3] shows an example of using the asymmetric normal with u = 0.1 and p = 0.1. 
The straight line is the noise-less relationship and the dots are the noised samples obtained. 
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Figure 3: Linear example with asymmetric noise with p = 0.1. 




(b) Prediction of the asymmetry. 
Figure 4: Results of learning a linear regression model to data with asymmetric noise. 


Since p < 0.5, the distribution creates less points with negative measurement errors and 
makes the positive errors larger. From this image, it is clear that a standard normal is not 
able to ht well the noise, since the region with high concentration of points is close to the 
line, but it is concentrated on one side of the mean noise. 

We performed 100 simulations for each value of p G {0.1,0.2,... , 0.9}, where in each run 
the values of /3 were sampled uniformly in the interval [—1,1] and the underlying standard 
deviation a was set to 0.1. The inputs, which were shared by all simulations, were given by 
101 equidistant points between —1 and 1. 

Figure |4a] shows the resulting likelihood of the htted model, where the dashed line 
represents equal likelihood. When p = 0.5, both models exhibit similar likelihoods, as we 
expected since this case describes the symmetric normal distribution. Furthermore, since 
the symmetric normal is a particular case of the asymmetric one, its likelihood can not be 
higher than the likelihood of the asymmetric normal. In fact, the asymmetric normal has 
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higher likelihood in all simulations performed. However, when we set p = 0.1 or 0.9, both 
symmetric and asymmetric models have lower likelihood, with the asymmetric one fitting 
better, as expected. The decrease in likelihood for the asymmetric model can be explained 
in part by Equation (I12cl) . where we have shown that the model loses likelihood by making 
p more distant from 0.5, while the decrease for the symmetric normal is due to incorrect 
noise modelling. 

It is important to highlight that, just like the terms In A in Equation (Hip and — Incj 
in Equation ()13p which prevents the error terms in the same equations to have almost 
no weight, the cost in Equation (jl2cp can be viewed as an implicit regularization that 
prevents one side of the partition to have no weight, and this regularization is inherent to 
the distributions defined in Equations ([3]) and ([7]) and is not artificially imposed. 

Moreover, there is a similarity between the resulting likelihoods for p = 0.1 and p = 0.9. 
This is expected, since there is a similarity between the two, with p = 0.1 favoring positive 
noises as much as p = 0.9 favors negative ones. 

Figure |lb] shows the correct and predicted values of p, again with the dashed line rep¬ 
resenting the identity function, where predicted values p are represented by their mean and 
95% confidence interval. The mean prediction is clearly close to the true value, and the 
large variation of fitted weights p is due to the small number of samples, since the model 
is more flexible. However, when comparing the likelihood values for p = 0.5 in Figure l4al 
we see that the large spread of predicted values, from p = 0.35 to p = 0.65 approximately, 
does not interfere in the likelihood, as the value is similar to the normal that has p = 0.5. 

Although it might seem that the results in Figure |4b] are not the maximum likelihood 
estimates, since they may be far from the real parameter p used to create the noise, we 
remind the reader that they may differ for a finite number of samples, just like any other 
estimate. For instance, for M samples Xi ~ N{p,a‘^) drawn from the normal distribution, 
the maximum likelihood estimate for the mean is given by /I = but this estimate 

depends on the value of the specific xt sampled. If we consi der the uncertainty on X i, it can 
be shown that the estimate is given by /I ~ N{p,a‘^/M) ( Krishnamoorthv . 20061 ). which 
specifies a random variable that only converges to the real value /r as M —>■ oo. Therefore, 
for finite number of samples, the parameter p may differ from p and still be a maximum 
likelihood estimate. 

Therefore, we have shown that the asymmetric normal noise model is able to fit as well 
as the symmetric normal when the noise is indeed symmetric, and outperforms it when there 
is asymmetry in the noise. This motivates the use of the asymmetric normal distribution as 
a generalization of the normal distribution, thus being able to adapt to the observed noise 
asymmetry. 


5.2 Hidden Markov Model with Asymmetric Emissions 

While the creation of more flexible distributions by introducing the asymmetry is in itself 
interesting, with the possibility of fitting different data while keeping the interpretability, its 
use may also provide additional insights of practical relevance. To illustrate the application 
of the new distributions, we will use a hidden Markov model (HMM) to fit a time-series. 

A HMM with K states is defined by the initial distribution on the states vr, the transition 
matrix between states T, and the parameters for each distribution associated with each state 
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9i,... ,6 k- For the normal distribution, 6i is given by and at, while for the asymmetric 
version, pi is also included. In this application, we will build two HMM, one with only 
symmetric and one with only asymmetric normal distributions. 

To improve the initial estimates for the HMM, we first fit the data using a mixture model 
with weights w and with the same parameters for the emission distributions. Once the 
expectation maximization algorithm runs for 100 iterations, we set n = w and T = tclixic, 
such that every sample has the same prior probability over the emission distributions. 
Additionally, for the asymmetric version, we first fit the samples, both for the mixture and 
the HMM, using the method described in Section 14.31 

The data used was the Dow Jones Industrial Average index (DJI) from its first quotation, 
on Jan 29, 1985, to its last quotation of 2014, on Dec 31, 2014, with the prices adjusted for 
dividends and splits, where we consider that its valu e follo ws the lognormal distribution, 
as usual in the economics field (jAitchison and Brownl . 119571 1 . Each sample is composed of 
the return over investment’s (ROI) logarithm for consecutive days, that is, the sample is 
given by Si = log(uj+i/uj), where Vi is the quotation in the i-th day. If either day of a pair 
does not have a quotation, what happens if one of them is on a weekend for instance, then 
that sample is considered missing. Therefore, the HMM has one state for each day between 
those dates. 

The main motivation of using this kind of problem is that the hypothesis of symmetry 
implied by the normal distribution may not reflect the reality. It is well known that stock 
markets can have periods of very high or low return, which sometimes characterize bull 


or bear markets ( Edwards et ah . 20131 ). Therefore, we expect to see improvements by 


introducing an emission distribution that is able to exhibit such asymmetric behavior. 

Table [T] shows the final log-likelihood for the samples with different number of possible 
states K. As expected, using the asymmetric distribution provides greater likelihood due 
to its additional flexibility. Moreover, increasing the number of states also increases the 
difference in likelihood. Since the number of states in which the HMMs differ the most is 
given hy K = 5, the subsequent analysis will consider only this case. 

Figure [5] shows the emission distributions for each HMM, with the mode dashed to 
highlight the asymmetry. While some asymmetries are more subtle, like in components C4 
{p = 0.476) and C2 {p = 0.510), others are more noticeable, like Cl {p = 0.612) and C5 
{p = 0.570). In special, the component C3 has the largest asymmetry of all, with p = 0.260. 

Since the shape of the base distribution, in this case the normal distribution, has been 
preserved in each side, the weight for each case can be used to provide some additional insight 
into the state. For example, the state associated with the component C3 is considerably 
certain that the index will rise {x > 0), which none of the emissions in the symmetric case 
indicates. 


Table 1: Parameters’ log-likelihood 


K 

Symmetric 

Asymmetric 

2 

19310.47 

19310.91 

3 

19480.27 

19481.32 

4 

19509.24 

19514.80 

5 

19519.82 

19538.44 
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Table 2: Transition entropy (bits) 


Source state 

Symmetric 

Asymmetric 

1 

0.1256 

0.0847 

2 

1.4972 

0.4564 

3 

0.4320 

0.2108 

4 

0.1815 

0.1836 

5 

1.1986 

0.3565 



X 



(b) Asymmetric 

Figure 5: Probability density function for each emission distribution. 


While the increased likelihood and the presence of asymmetry are expected from using 
a more general version of the distributions, other interesting and potentially useful results 
appear when we analyze the distribution over states. 

When we evaluate the transition probabilities for each state, shown in Figure [ 6 l it 
becomes very clear that the transitions for the asymmetric version are usually much less 
ambiguous. To evaluate this quantitatively, Table [2] shows the entropy of the transitions 
out of each state, with the maximum entropy being given by log 2 5 = 2.3219 bits. 

Except for the fourth state, which suffered a minor increase in entropy of 1.2% and had 
no noticeable difference in Figure [ 6 l all other transitions reduced the entropy considerably, 
from 32.6% to 70.2%, with clear differences in the transition. 

This reduced entropy also occurs in the states themselves, as shown in Figure [7l Fig¬ 
ure [7a] shows the histogram of normalized entropies, which is the entropy divided by the 
maximum entropy, for both HMMs and considering the state of missing data or not. In 
both cases, the asymmetric version has considerably more states with lower entropy than the 
symmetric version. Note also that the asymmetric version appears to suffer less from miss¬ 
ing data, while the symmetric version has a spike around 0.6 that does not occur without 
considering these states. 

To emphasize the difference, Figure iTbl shows the entropy QQ plot, which is composed 
of plotting the normalized entropy quantiles of each HMM’s states, with the dashed line 
representing the identity. From this figure, we note that the symmetric HMM’s states indeed 
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Target Target 


(a) Symmetric 


(b) Asymmetric 


Figure 6: Transition probabilities of the HMM states using the symmetric and asymmetric 
distributions, with darker having higher probability. 



(a) Entropy histogram (b) Entropy QQ plot 


Figure 7: Normalized entropy of the HMM states with and without the missing data. 


have higher entropy than the ones from the asymmetric, with the first reaching normalized 
entropy 0.4 before the latter gets 0.2, and a quantile with asymmetric distributions almost 
always has less entropy than its equivalent symmetric, with the only exceptions being the 
first few quantiles with very low entropy. Additionally, this figure also shows that the curves 
that considers the missing data is close to the one that does not, also indicating that the 
asymmetric version has good performance despite this lack of information. 
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6. Conclusion 

In this paper, we have introduced the concept of a constrained mixture and provided two 
examples of how it can be used with the Laplace and normal distributions to create new 
asymmetric distributions. The new distributions were shown to generalize their underlying 
distribution while keeping important properties, such as belonging to the exponential family 
and having maximum likelihood estimates and conjugate priors with known closed-form 
expressions. Moreover, the distributions were shown to have an inherent regularization term, 
that is, a regularization that comes directly from the likelihood and not an imposed cost, 
that penalizes the asymmetry, such that the distribution avoids unnecessarily deforming 
the symmetric underlying distribution. 

One of the new distributions, the asymmetric normal distribution, was compared to the 
symmetric version in a regression example with asymmetric noise. This allowed a better 
understanding of how the asymmetric distributions operate and showed that, since the 
symmetric versions are particular cases of the asymmetric distributions, the asymmetric 
ones must have higher likelihood. 

The asymmetric and symmetric normal distributions were also compared when used for 
emissions in a hidden Markov model (HMM) for a stock index. Results show that, as one 
would expect, the additional flexibility of the asymmetry allowed the distribution to better 
fit the data, providing increased likelihood and with larger differences as more states were 
used. 

A positive consequence of this flexibility and better fitting was additional certainty in 
the states and their transitions. We have shown that, when the HMM had 5 states, most 
probability distributions over the states had a considerable reduction in their entropy even 
when missing data is considered. Moreover, although one transition distribution, which 
already exhibited low entropy, had its entropy increased by 1.2%, all other transitions had 
reduced entropy, losing from 32.6% to 70.2% of their values, and the largest transition 
entropy is less than 20% of the maximum entropy, compared to 64.5% for the symmetric 
version. 

Future investigations involve analyzing if it is possible to know the maximum likelihood 
estimates and conjugate priors and their closed-form expressions for the Laplace and normal 
distributions when the domain split does not occur at the mode. If so, the effect of using the 
constrained mixture in other distributions of the exponential family and the use of multiple 
segments should be investigated. Besides this theoretical research, the use of asymmetry to 
characterize loss functions in machine learning is of interest, since it can make the system 
focus more on predicting low or high values. 
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Appendix A. Proof of Theorem [2] 

Proof Constraint [T] is trivially satisfied, since both sides converge to fi. From Constraint [3l 
one has that 



x; //, X,p)dx 


/3 


exp(AQ; ^{x — p))dx+ I exp(—Aa(x 




0 ^ + 1 
Aa 


= 1 , 



which is satisfied by the dehnition of /3. 
From Constraint 01 one has that 


'ijj{x-, fi, X,p)dx = P / exp(AQ! ^{x — p))dx 

I J —OO 

Aa a 




a 


p 

i-p 


X a^ + 1 A a^ + 1 


= P, 


+ 1 

I-p; 


which is satisfied by the definition of a and (5. 

Finally, to satisfy Constraint [21 let 0_(/i,A,p) = [p,Xa~^] and 0+(/r,A,p) = [/i,Aa]. 
Then 


V^(x; pL, X,p) = j3 exp(—Aa(x — /u))I[x > fj] + (3 exp(—Aa — x))I[x < p] 

= ©+(•)) + ©-(•)) 

Aa A 

2 2a^ 

= ^^</)+(x; 0 +(.)) + <^-(x;0_(-)) 

a^ + 1 a^ + 1 

= 2(1 -p)(/)+(x;0+(-)) +2p(/)_(x;0_(-)), 


which sets Z = 1/2. 
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Appendix B. Proof of Theorem [5] 

Proof Constraint [T] is trivially satisfied, since both sides converge to fi. From Constraint [3l 
one has that 



which is satished by the definition of /3, where erf(-) is the error function. 
From Constraint 01 one has that 


ip{x; fj,,a,p)dx = 


X — p 

aa 


dx 




J- 

2 \V2aaJ 

2a 


2""^ 2cj(a2 + l) 


aa = 


p 

i-p 


a 


T^l +1 

l-p I 


a 2 + 1 


= P, 


which is satished by the dehnition of a and 13. 

Finally, to satisfy Constraint 01 let 0_(/r,A,p) = [n,aa] and 0+(/r,A,j?) = [n,aa~^]. 
Then 


'iljix-,fi,a,p) = 13^ >lA+l3^ f-—-) II[x < p] 

\aa \ era / 

= /3aa~^(f)+(x; 0+(-)) + I3aa4>+{x\ 0+(-)) 

2 2a2 

= ^^<^+(x; 0 +(.)) + 

a^ + 1 a^ + 1 

= 2(1 -p)0+(x;0+(-)) + 2p(/>_(x;0_(-)), 


which sets Z = 1/2. 
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Appendix C. Proof of lemma for Theorem [8] 

Lemma 12 Let p G (0,1) and S = {sj}, i G {1, 2, ... , N}, Si G M, be given. Let the pdf of 
the asymmetric Laplace distribution be given by Equation ([3]) . Then the function 

l{h) = a '^{si- p)- ^ {si - p), 


where S- = {sj G S\si < p}, = {sj G S\si > p}, and a = is convex. 

Furthermore, let /i,^' G M, // < p' . If there is some Si G S such that p < Si < p' , then 
'y{tp + (1 - t)p') < t^{p) + (1 - t)l{p') for all t G (0,1). 

Proof Let t G [0,1] and t' = 1 — t. Let p, p' G'R, p < p'. Let pi be a variable associated 
with sample Sj, such that 

Pi = al[sj >tp + t'p'] — <tp + t'p'], 

where ![•] is the indicator function. Since a > 0, we have that Pi — a <D and pi + a~^ > 0, 
and Pi — a = Q pi + Oi~^ / 0. 

Let S- = {sj G S\si < p}, 5+ = {si G S'|si > p}, S'_ = {s* G S\si < p'}, 5+ = {s* G 
S\si > p'}, S* = S+n S'_. Then 

pitp + t'p') 

= a ^ {si-tp- t'p) - a~^ ^ {si -tp- t'p) + ^ pi{si -tp- t'p) 

SiGS'j^ SiGS— SiGS* 

= tp{p) + t'-f{p') + ^ {pi{si - tp- t'p') - ta{si - p) + t'a~^{si - p')) 

SiGS* 

( \ 

= tp{p) + t' 7 (/r') + ^ \t{si-p) {pi - a) +t' {si - p') {pi +a 

Si&S* Y 

< tp{p) +t'-i{p'), 

which proves that p{p) is a convex function. 

Moreover, if there is some p < Si < p', then Si G S* and either pi — a < 0 or > 0, 

so that ^{tp + t'p') < t'y{p) + t'^{p') for all t G (0,1). ■ 


Appendix D. Proof of Theorem [8] 

Proof Prom Equation (llip . one can see that p can be optimized independently from the 
value of A. Let ^{p) be defined as in Lemma [121 such that 

ln£ = C+|5| \ii\-\p{p), 
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where C is a constant. Therefore, the value fi* that minimizes 7 (^) is the maximum 
likelihood estimator. The function 7 (//) can be rewritten as 


7(/i) = a \si-^\+a ^ Y I'Si-Afli 


SiGS^ 


SiGS- 


which is associated with the log-likelihood of the weighted scale-free La place distribution , 
whose maximum likelihood estimate /i* is given by the weighted median (lEdgeworthl , 
with samples in S- and 5+ weighting a~^ and a, respectively. 

For A, the optimal value is given by: 


91n£ 




which solves for the value provided by the theorem. 

From Lemma m we also know that there is no sample between two optima and 
IJ, 2 , of or there would be some t G ( 0 , 1 ) such that + (1 — < 

^ 7 (/^i) + (1 “ < ™ax{ 7 (/r^), 7 (/i 2 )}, which contradicts the optimality of or /i^. ■ 


Appendix E. Proof of Theorem [9] 

Proof Using Equation (jlOp . we have that the prior can be written as: 
f{p, A; x,^) = C exp {-Xaxi - Xa~^X 2 + In /3) 

= Cexp ^-Aaxi - Aq;“^X 2 + i' ^In A + ^ (Inp + ln(l - 
= C'exp(-Aaxi - 

= C exp(—Aaxi — Aa“^X 2 )A^i?(p; v') 

= UiexpP—Aaxi “ AQ;“^X2)(Aa)'^^^(Aa“^)'^/^S(p; u') 

= G{Xa; v', xi)G{Xa~^]v', X2)B{p; v'), 

where u' = 1 + z//2. ■ 


Appendix F. Proof of lemma for Theorem [TOl 

Lemma 13 Let p G (0,1) and S = {sj}, z G {1, 2,... , A}, Sj G M, be given. Let the pdf of 
the asymmetric normal distribution be given by Equation ©• Then the function 

-f{p) = a~^ Y (si-pf + a^ Y 

SiGS— SiGS^ 


where S- = G S\si < p}, S+ = {sj G S\si > p}, and a 



, is strictly convex. 


22 











Asymmetric Distributions from Constrained Mixtures 


Proof Let f{x): M —)• M be a function and f^"'\x) its n-th derivative. If f{x) and f'{x) 
are continuous and f"{x) > 0 for all x, then f{x) is strictly convex. 

For fixed 5_ and 5"+, 7 (/r) is a strictly convex quadratic function of fj,. If 7 (/r) is 
continuously differentiable and its derivative is monotonically increasing for variables S- 
and S'+, then is strictly convex. 

Let s* = min 5*+. The limit /r —>■ s* is given by: 


lim 7 (/r) 


lim a 

tl^s~ 


E 

SiGS— 


{si 


Si£S+ 


= la^ ^ (si-s^f + a^ ^ (si 


SigS_U{s*} 


SigS+\{s*} 


= lim 7 (/r), 



which proves that 7 (/u) is continuous. Its derivative is given by: 


7 '(^) = -2a^ {si - - 2a ^ ^ (s* - ^), 

SiGS-\- s-iGS— 


(14) 


and we can prove that it is continuous using the same method as before. 

Since > 2|S'| minjo:^, q;“^} > 0, the derivative 7 ^(/u) is monotonically increasing 

and 7 (/i) is strictly convex. ■ 


Appendix G. Proof of Theorem 1101 

Proof Prom Equation (jl3p . one can see that ^ can be optimized independently of the value 
of A. Let 7 (//) be defined as in Lemma [T3l such that 


InT = C - |S| Incr - 

where C is a constant. Therefore, the value fi* that minimizes 7 (^) is the maximum 
likelihood estimator. And, since 7 (/u) is strictly convex, this value is unique. 

From the first order optimality condition, we can solve Equation (|14p to find the optimal 
fi* stated in the theorem. For a, the optimal value is given by: 


dlnC 

da 


= 0 , 

a 


which solves for the value provided by the theorem. 
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Appendix H. Proof of Theorem 1111 

Proof Using Equation (|10l) . we have that the prior can be written as: 


Xi 

X2 

2(T^a“^ 

2a^a‘^ 

Xi 

X2 

2(7^0;“^ 

2(T^q;^ 

Xi 

X2 

. 2a‘^a~‘^ 

Xi 

2(T^q;^ 

X2 

. 2cj^q;“^ 

Xi 

2c7^a^ 

X2 

. 2cj^q;“^ 

2(T^q;^ 


+ zyln/3) 


+ V ^ln2 — Ina + - (Inp + ln(l — p)) 
) (T~^B{p, Ui) 


= Ig{a‘^a^; 1 / 2 , ^ i' 2 ,x'i)B{p] ui) 

where 1^1 = 1 + z^/2, z ^2 = ^— 1 and Xi = Xi/2- 
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